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Abstract. When network and graph theory are used in the study of complex systems, a typically finite set of nodes of 
the network under consideration is frequently either explicitly or implicitly considered representative of a much larger 
finite or infinite region or set of objects of interest. The selection procedure, e. g., formation of a subset or some kind of 
discretization or aggregation, typically results in individual nodes of the studied network representing quite differently 
sized parts of the domain of interest. This heterogeneity may induce substantial bias and artifacts in derived network 
statistics. To avoid this bias, we propose an axiomatic scheme based on the idea of node splitting invariance to derive 
consistently weighted variants of various commonly used statistical network measures. The practical relevance and 
applicability of our approach is demonstrated for a number of example networks from different fields of research, and 
is shown to be of fundamental importance in particular in the study of spatially embedded functional networks derived 
from time series as studied in, e. g., neuroscience and climatology. 



1 Introduction 

1.1 Motivation 

In the last decades, network and graph theory have successfully 
been applied to vaiious kinds of complex systems, and many 
different measures have been defined to study their structural 
and topological properties. Most of these are of a combinatorial 
nature, based on counts of certain nodes, links, triangles, paths, 
etc. (for an overview, see, e. g., | T-'Sl). 

Often, a typically finite set of nodes of the studied network 
is either explicitly or implicitly considered representative of a 
much larger finite or infinite set of objects of interest (which we 
will call the domain of interest or in sampling contexts the pop- 
ulation), either by being a somehow selected or sampled subset 
of this larger set or, more often, by constituting some kind of 
discretization, aggregation, or coarse-graining of it. Typical ex- 
amples are networks of 

1 . functional connections between differently sized regions of 
interest (ROIs) in the human brain, as in ||7||9| and Fig.[T] 

2. dynamical couplings or statistical associations between 
time-series measured at discrete regular grid points, in ir- 
regular mesh cells, or at otherwise sampled discrete loca- 
tions on some manifold (e. g., a climate network either us- 
ing a latitude-longitude-regular grid on the Earth's surface, 
as in ]10||1 1) , or with meteorological stations as nodes), 

3. routing connections between autonomous systems (AS's) in 
the internet, representing groups of individual servers, users 
or ranges of IP addresses, as in ||T2Hl5), 



4. cross-references between articles containing different 
amounts of content in an online encyclopedia, as in 1 16|17| , 

5. social relationships between households consisting of dif- 
ferent numbers of individuals, as in 1 18 , 19], 

6. trade relationships between countries with differing gross 
domestic product and representing different numbers of 
consumers, as in |20| and Fig.|2j 

7. proximities between sampled state vectors in the (recon- 
structed) phase space of a dynamical system pT|j23] , sam- 
pled at irregular points in time. 

Depending on the meshing, sampling, or parcellation method, 
the chosen level of aggregation or description, and the avail- 
ability of data, some parts of the domain of interest might be 
represented by relatively more nodes than others, e. g.: the polar 
or densely populated regions on the Earth's surface in a climate 
network (since grid points cluster at the poles and meteorolog- 
ical stations cluster in populated areas); the subcortical area in 
the human brain (when this is parcellated into smaller regions); 
the younger AS's in the internet (usually having below-average 
numbers of users); the more technical subjects in the encyclo- 
pedia (usually being organized into shorter articles); the child- 
less population of a village (having a higher ratio of households 
per people); in the trade network the industrialized world when 
the interest is in consumers (consisting of more countries per 
population) or the non-industrialized world when the interest 
is in GDP (consisting of more countries per GDP); the more 
densely sampled time periods in the dynamical system. Often, 
this problem of over-representing some parts of the domain of 
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interest is directly related to the size distribution of the objects 
chosen as nodes. That distribution is frequently heavy-tailed, 
e. g., for AS's in the internet, articles in the encyclopedia, and 
countries in the trade network. 

The described representation bias can cause serious pitfalls 
in the interpretation of results obtained from the selected net- 
work, if one wants to make inferences about structural and 
topological properties of the domain of interest (i. e., in the 
above examples, about all locations on the globe or in the brain, 
all users of the internet, all units of content in the encyclope- 
dia, all persons in the village, all consumers, or all time points 
on the trajectory, respectively). From both a statistical and an 
approximation-theoretical point of view it is therefore impor- 
tant to first decide which structural or topological properties of 
the domain of interest we are interested in (e. g., the connectiv- 
ity distribution or amount of clustering), and then to determine 
what measure in the selected network could be used as a good 
estimate or approximation of these properties of the underly- 
ing domain of interest. Often, the network construction (that is, 
the choice of nodes and links) also involves some parameters 
like sampling density, grid origin, orientation and size, mesh 
size, or link inclusion thresholds, and there are often system- 
atic influences of these parameters on the results of any mea- 
surements in the resulting network. This can lead to selective 
bias, as in the case of the dorsal cingulate gyrus in the brain 
network, whose betweenness value (a popular measure of node 



importance) depends very much on whether it is treated as one 
node DCG or as two nodes DCG.L and DCG.R (see Sec.[53]l. 
The effect can even lead to completely artificial features like in 
the climate network, as depicted in the centre of Fig.[3](A, B). 

Because of the above observations, our purpose in this pa- 
per is to improve the estimation or approximation power of 
common network measures by introducing into their calcula- 
tion a suitable kind of weighting of all individual nodes. For 
this, we use node aggregation weights based on, e. g., ROI 
volume, inverse grid density, mesh cell size, inverse sampling 
density, IP address ranges of AS's, article length, household 
size, or a country's population or GDP. A suitable choice of 
weights is sometimes difficult (e. g., for AS's or countries) and 
the weights may have to be estimated (e. g., in case of sampled 
state vectors). Our focus in this paper, however, is not on the 
derivation of suitable weights, but on how to make proper use 
of them once they are given. 

To avoid confusion, we emphasize that there exists a theory 
of "weighted networks" |25 | in which links instead of nodes 
have weights representing quantities like length or capacity. 
But that kind of weights and the related theory is of little help 
here since the type of situation we are concerned with calls for 
node weights instead, and we will see that the corresponding 
measures differ from those in the other theory, even if the node 
weights are somehow translated into link weights. Actually, 
some real-world networks (like the world trade network) are 
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Fig. 2. (Colour online) World trade network of significant trade relations in 2009 (spring model layouts, country codes according to ISO 3 166). 
Disk area is proportional to 2008 GDP (A) or population (B,C). Node colour indicates three-group solutions of Newman's (24| modularity- 
based partition algorithm (see Appendix B, Modularity). The node-weighted (n. s. i.) version using GDP (A) and the unweighted version (B) 
give almost identical groups, whereas the n. s. i. version using population (C) differs considerably, producing more equally populated regions. 
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Fig. 3. (Colour online) Comparison of unweighted and weighted 
(n. s. i.) versions of degree (A,C) and clustering coefficient (B,D) in 
the northern polar region (Lambert equal area projection) of a global 
climate network representing correlations in temperature dynamics. 
The high values at the pole in (A,B) turn out to be an artefact of the 
increasing grid density toward the pole, as demonstrated by (C,D). 



probably best described as having both node and link weights, 
as well as being directed. In this paper, we are treating the 
case of undirected networks with node weights, but the same 
methodology can easily be applied to transform network mea- 
sures that already make use of link weights or link direction 
into node-weighted versions that use node weights as well. 



1.2 Climate networks 

To exemplify our approach, assume that the domain of inter- 
est is the set of all points on the Earth's surface and the sig- 
nificant linear correlations between the surface air temperature 
time series at pairs of such points. This can be interpreted as a 
"network" (mathematically, an infinite simple graph) with un- 
countably many nodes and (unknown) links. However, we only 
observe data for a finite subset of points, say those 64, 082 reg- 
ular grid points that have integer latitude and longitude degrees 
(which is a fairly common grid type in Earth sciences). Then 
the set of significant linear correlations between the tempera- 
ture time series at these sampled points defines a (finite) cli- 
mate network pOpTp6|j3T[ whose properties, as measured by 
common network statistics, are somehow hoped to be represen- 
tative of related properties of the temperature dynamics on the 
whole globe. E. g., the degree of a node in the finite network 
corresponds to how much surface area this point is correlated 
to in the whole domain of interest. When degree is computed 
in the standard way, however, the resulting large regional dif- 
ferences will mostly reflect the strongly differing amounts of 
surface area each node represents (small area per node at the 
poles, large area per node on the equator), instead of indicating 
"real" regional differences in the connectivity of the underlying 
domain of interest (the temperature field). 

The significance of observed features can often be assessed 
by comparing results with those obtainable in a similar "bench- 
mark" network in which the links have been replaced by a 
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spatially homogeneous link distribution, which was done, e. g., 
in (32] to show how the underlying geometry of a cortical net- 
work influences network statistics. Similarly, for our climate 
network, a perfectly homogeneous link distribution in the do- 
main of interest (like the one in which two points are linked iff 
their angular distance is less than five degrees) would lead to 
regional differences of the degree distribution in the grid-based 
network, just because its node density and therefore its link 
density increase towards the poles (Fig.|4] dashed lines). For 
the same reason, the standard local clustering coefficient shows 
an artificial increase towards the poles. Fig.[3](A, B) shows such 
artifacts in a real-world climate network similar to that of 1 11 1. 
We will see below that this effect can be avoided by using a 
node weight proportional to the inverse node density (= cosine 
of latitude), and by using the sum of all neighbour's weights as 
a measure of degree (this measure being called area-weighted 
connectivity in p6)), and the weighted proportion of inter- 
linked pairs of neighbours of a node as a measure of clustering, 
instead of the classical degree measure and clustering coeffi- 
cient. 

Figure |3](C, D) shows the meaningful regional differences 
in the example real-world climate network that remain after 
the artifacts have been removed in this way. In that exam- 
ple, we can check the validity of this by comparing these re- 
sults to those obtained from a climate network based not on 
a latitude-longitude grid but on a "geodesic" grid that has an 
approximately homogeneous node density all over the globe 
(see also |33|, results not shown here for brevity). Using the 
classical degree and clustering measures in the latter, homoge- 
neously sampled network gives results almost identical to those 
in Fig.[3](C, D) rather than (A, B). Such a change of grid, how- 
ever, usually requires some kind of interpolation of the avail- 
able data, which can introduce other problems and is obviously 
only possible for specific kinds of network constructions. 



1.3 Outline 

Surprisingly, to our best knowledge, the by now vast litera- 
ture on complex networks contains almost no node-weighted 
measures, except for the above-mentioned area-weighted con- 
nectivity measure, whereas other techniques (e. g., finite ele- 
ments) and methods of data analysis (e. g., empirical orthog- 
onal functions, a weighted version of principal components 
analysis [34 1) often use some weighting or other adjustment 
to avoid similar biases or artifacts. 

In this paper, we therefore present a fairly general strategy 
for deriving node-weighted versions of network measures that 
can be expected to give estimates or approximations of proper- 
ties of the domain of interest that are in a certain sense consis- 
tent whenever the network is of the type in which the links can 
be interpreted as indicating some kind of similarity or "close" 
relationship, as it is more or less the case in all the cited exam- 
ples. The approach will not be directly useful for other kinds of 
networks, e. g., if links represent some kind of "complementar- 
ity" instead of similarity, like in most bipartite networks. Also, 
it will not be applicable when only the network as a whole can 
be considered "representative" of the domain of interest, but 
when individual nodes cannot be considered representative of 



some well-defined part of it, as will often be the case when net- 
work sampling methods such as random node or link sampling 
or snowball sampling are used (related estimation problems are 
treated in |6l|35)). 

We then apply this strategy to many of the commonly used 
network measures and illustrate the effects in a number of ex- 
ample networks from the above list. Fortunately, there is a quite 
simple pragmatic approach to finding useful weighted versions 
of network measures which allows us to postpone a more de- 
tailed analysis of the statistical estimation or numerical ap- 
proximation properties for further research. This approach is 
axiomatic rather than analytic in that it requires our measures 
to fulfil an easily verified condition of node splitting (or twin 
merging) invariance. 

After stating preliminary matter in Sec. |2] and giving more 
details on our illustrative example networks (Sec. [3]), we will 
introduce this concept formally and shortly relate it to a statis- 
tical and approximation interpretation in Sec.|4] We then pro- 
ceed with presenting a comprehensive set of according network 
measures in Secs.[5j|6] illustrating each one's effect in those ex- 
ample networks for which the respective network measure has 
been considered important in the literature, but not aiming at 
analysing each example network with the full set of measures. 
We end with a more detailed description of an application to 
climate networks in Sec.|7] and a conclusion (Sec. [8]l. In two 
Appendices (online), we present some additional measures, 
give versions of the new measures that allow for a simpler com- 
parison with their unweighted counterparts, and shortly discuss 
how the related parameter of typical weight can be estimated. 



2 Preliminaries 

Let G — {,yV denote a finite undirected simple graph (the 
network under consideration) with known node or vertex set 
jY , edge or link set S C '■ i ^ j G ^}, and adja- 

cency matrix A = {aij)i j^^/y, where atj £ {0, 1}, and atj = 1 iff 
{/, j} € S". For simplicity, we assume that o/K = {!,... , A^} for 
some natural number N > I. The neighbours of a node v G yy, 
i. e., the members of v's (punctured) neighbourhood 



= {/ e ^ : fl,v = 1} = {/ e ^ : fly; = 1} 



(1) 



are those nodes that v is directly linked to by an edge. We will 
also use the extended adjacency matrix — {afj)ij^,yy = A + 1 
with 

where I = is the identity matrix and 5 the Kronecker 

symbol, with 5,y = 1 for / = /' and 5,/ = for / ^ j. Moreover, 
we will need the extended or unpunctured neighbourhood 



■J/'+ = {i e -jV : al = 1} = {v}. 



(3) 



Note that in classical mathematical topology, the term "neighbour- 
hood of a point" implies that the point itself is included, otherwise one 
speaks of a "punctured" neighbourhood. In the network literature, the 
term "neighbour" also sometimes refers to nodes not directly linked. 
In our terminology, a node is not a neighbour of itself but still a mem- 
ber of its unpunctured neighbourhood. 



Jobst Heitzig et al.: Node- weighted measures for complex networks with spatially embedded, sampled, or differently sized nodes 



5 



In addition, we will assume that each node v is assigned 
a positive real-valued (aggregation) weight vv,,. As many of 
our measures will involve unweighted or weighted means over 
nodes or pairs of nodes, we also introduce the total weight 



(4) 



and a shorthand notation for (weighted) averages of functions 
of nodes or node pairs: 



{h{'J))ij = ^'Lie.yr'LjG.A'hi'J), and 
{KiJ))7j = ^^L'g,^I;g..^'>v,-/i(/,7)w;. 



(5) 



Most of the remaining notation will follow the reviews of 
Newman 11], Boccaletti [2|, and da F. Costa f3l. In cases where 
a measure is commonly normalized using a factor of 1/(A^ — 1 ), 
1/(A^ — 1)(A^ — 2), etc., we will use instead the factors \/N, 
l/N^ eXc, to keep things simpler 



3 Examples of networks with nodes of 
different size 

As stated in previous sections, node weights are ubiquitous 
for graph representations of complex systems. We will show 
the applications of our weighted measures to various exam- 
ple networks, ranging from the human brain, the internet, over 
Wikipedia to world trade. Firstly we describe the details of the 
constructed weighted networks. The focus application to cli- 
mate networks will be presented individually in Sec.|7] 



3.1 Human brain 

Functional Magnetic Resonance Imaging (fMRI) time series 
have been widely used to study neural activities in the brain 
from a network perspective. A node is represented by a cor- 
tical region of interest, while a link is often characterized 
by some statistical association measuring the correlation be- 
tween different regions (i. e., linear Pearson correlation, non- 
linear mutual information, or frequency dependent correlation 
by Wavelets |7|, etc.). We consider two regions are connected 
if their correlation exceeds a threshold, which can be either 
based on the correlation values or in terms of probability un- 
der some appropriate null hypothesis. The resulting functional 
connectivity of nervous systems has been shown to display 
high clustering and short path length which confers a capabil- 
ity for both specialized or modular processing in local neigh- 
bourhoods and distributed or integrated processing over the en- 
tire network i7]|^. The cerebral cortex is a thin folded sheet 
tightly confined by the skull and is thus an archetypal example 
of a complex network that is strongly constrained by geometry. 
The understanding of the properties captured by a variety of 
network measures (i. e., high clustering, short path length, mo- 
tifs, and modularity, etc.) has been pointed out to be very lim- 
ited because the role of the spatial geometry has been largely 



underestimated p2) . Many anatomical features show distance- 
dependent properties, e. g., the density of corticocortical neural 
connections, volume, processing steps, signal travel times, and 
genetic encoding needed to specify connectivity. 

We re-examine a version of the network of functional con- 
nections between differently sized Region of Interests (ROIs) 
in the human brain as it was described in |7|. After some ap- 
propriate preprocessing on the acquired fMRI data, the network 
consists of 90 cortical and subcortical time series extracted 
from each individual. The resulting graph is shown in Fig. [T] 
where the weight of a node is represented by its associated vol- 
ume of the ROIs. 



3.2 Internet 

A well-known type of internet mapping is obtained by consid- 
ering so-called autonomous systems (AS's). On the AS level, 
each node represents an AS while each link between two nodes 
represents the existence of a peer connection among the cor- 
responding AS's in Border Gateway Protocol (BGP) routing 
tables. Traditionally, much interest is in identifying a possi- 
ble power law P{ky > x) ;c'^^ for the degree distribution 
1 12 13 15). Power laws have been reported by different stud- 
ies of AS maps, and their exponents seem to be stable over a 
number of years |12 15 1, which could help to devise a novel 
class of dynamical models of the internet. However, most stud- 
ies use BGP data collected by the Oregon route views project 
only, which may provide an incomplete picture of the internet 
connectivity 1 14|. 

Whereas in the literature, usually each AS is treated 
the same despite the considerable differences in size (three 
orders of magnitude), we associate here with each AS a 
node weight proportional to the size of the IP address space 
allocated to that AS in terms of Classless Inter-Domain 
Routing (CIDR) prefixes, a common measure of network size 
that can be used as an approximation of the fraction of the 
internet represented by that AS (although better measures 
might be possible). In other words, we consider the set of 
all IP addresses allocated by CIDR as our domain of interest 
Go which we study be means of a network G of AS's that 
each represent a certain part of Gq. To construct our net- 
work, we used the January 2010 BGP routing table snapshot 
(http : / / archive . routeviews . org/ oix-route-views/- 
2010.01/oix-full-snapshot-2010-01-27-1200.bz2) 
from the Oregon route views project, and a cor- 
responding CIDR prefix allocation snapshot from 
(http : / / www . cidr-report . org/ as2 . 0/ aggr . html), 
giving a network of over 30,000 nodes. 



3.3 Wikipedia 

Wikipedia is an intriguing research object from a sociologist's 
point of view: nodes are articles which are published by a num- 
ber of independent individuals in various languages, edges are 
reference hyperlinks which cover topics they consider relevant. 
Several models for the growth of Wikipedia have been pro- 
posed to mimic the plausible preferential attachment mecha- 
nisms that might explain the apparent scale-freeness of the re- 
sulting networks p6][T7) . 
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Fig. 4. Comparison of unweighted and n. s. i. versions of four conraion local network measures (see Sec.|5jfor definitions) in a real-world 
climate network (CN) and a benchmark network (BN). (A) degree ky,k*, (B) clustering coefficient Cy,C*,, (C) closeness centrality CCy,CC*,, 
and (D) Newman's random walk betweenness NBy,NB*. Measures are averaged along bands of equal latitude and plotted against latitude. 
Real-world global climate network representing correlations in surface air temperature dynamics. Benchmark network defined on the same 
grid with independent link probabilities depending on distance alone. The benchmark lines show that the observed increase in the unweighted 
degree and clustering coefficient near the poles at ±90° latitude is mainly due to the vanishing node weight of cos(latitude), whereas the effect 
on closeness centrality is much smaller. In case of Newman's random walk betweenness, the slight increase of the unweighted version towards 
the equator in the benchmark network reflects the fact that those nodes represent larger surface areas, hence a random walk on the globe will 
cross this area more often. This explains in part why in the unweighted version in the real-world network the central peaks are more prominent 
than in the n. s. i. version. 



As another example network, we used as nodes all 33,359 
articles containing the word "physics" from the 30 July 2010 
snapshot of the English language version of Wikipedia, and 
made an undirected link between two articles if either refer- 
ences the other, resulting in an average degree of 31.9. Other 
authors study directed Wikipedia networks |[T6j|T7], but treat- 
ing it as an undirected network can be partially justified by the 
fact that in Wikipedia one can follow references backwards us- 
ing the function "What links here". Since the individual arti- 
cles represent quite different amounts of text (between one and 
283kB), it is straightforward to use their size in characters as 
the node weights w,,. 



3.4 World trade 

Finally, we also consider for illustration a network of countries 
where two countries are linked when they trade considerable 
amounts (similar to |20|), e.g., if the total value of their mu- 
tual reported imports and exports in 2009 accounted for at least 
10% of the total reported foreign trade value of at least one of 
the two countries, based on data from comtrade.un.org. Such 
a network is shown in Fig.|2] The topological characterization 
of the world trade web (WTW) is of primary interest for the 
modeling of crisis propagation at the global level, and it has 
been reported in \2Ql that the unweighted WTW displays some 
typical properties of complex networks, i. e., scale free degree 
distribution, small-world properties, and high clustering coeffi- 
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cients. As node weights, we use either population or gross do- 
mestic product in 2008 (as reported by the IMF), both showing 
considerable differences. 

A much more realistic model of the world trade network 
would of course use weighted and directed links representing 
actual imports and exports, in addition to node weights, so one 
cannot attach much real-world importance to the exemplary re- 
sults that we will present here for illustration with this simpli- 
fied network. 



4 Approaches to node weighting in network 
measurements 

4.1 Statistical interpretation 

In many cases the nodes jV and links S of the studied network 
G are simply a subset from a larger (maybe infinite) network 
Go with nodes jVi^ and unknown links <oo that constitute the do- 
main of interest whose structural and topological properties we 
are interested in. Since often jVo is a manifold like the Earth's 
surface, the brain volume, or a phase space, we will call the el- 
ements of jV{^ points here although the following consideration 
also applies to discrete sets jVq like that of all households in 
a society. If we can interpret the node set jV to be a sample 
from the population of points resulting from some sam- 
pling procedure, then we might adopt a classical statistical ap- 
proach and consider any measurements in the sample network 
G as estimates of certain statistics of the population network 
Go that we are truly interested in. E. g., the classical measure 
of degree of a node v e jV , 

K^K(G)^\.A\='Lie..Yaiv. (6) 

could be interpreted as the simplest estimator of the number (or 
proportion) A:o(v) of points in jVi^ to which v is linked in Gq. 
If, however, the sampling procedure is such that certain points 
V € are selected for the sample with a higher individual 
sampling probability py than others, basic statistics tells us that 
a much better estimator of ko{v) is a weighted sum, 

h = Lie.yK = ^ie.yr ^i^-iv 0) 

with suitable node weights Wy ^ 0. As in the well-known 
Horvitz-Thompson estimator of a sample mean, the optimal 
weights Wy are given by inverse probability weighting, i. e., 
they are inversely proportional to the sampling probabilities, 
Wv °= i/Pv- E.g., if G is a climate network constructed from 
meteorological stations and the probability p,. of having a sta- 
tion in location v is proportional to the local human popula- 
tion density, then any analysis of G should assign a node at v a 
weight proportional to the inverse human population density, to 
make sure that the climates in sparsely and densely populated 
areas are equally represented. In some cases, statistical consid- 
erations can also motivate more sophisticated choices of the 
weights w,,, like the reliability-adjusted Kriging weights used 
for meteorological station data in [36, Eq. 15]. 

If we were to follow this statistical approach more thor- 
oughly, we would try to identify each property of the domain 
of interest we are interested in with some statistics /o of Go, and 



then use a suitably weighted estimator / of /o that is at least sta- 
tistically consistent (i. e., converges to /o in a certain sense as 
increases), hopefully also unbiased (i. e., has an average error 
of zero) and efficient (i. e., has small variance), and maybe even 
robust (i. e., is not very sensitive to only local changes in the 
network). Verifying these properties for a large number of dif- 
ferent network measures is however a research program requir- 
ing much analytical effort beyond the scope of a single paper. 
Moreover, it would likely require some complicated continuity 
assumptions on Go that would restrict the applicability of those 
measures considerably, which is why we pursue a simpler ap- 
proach here to find suitable weighting schemes for individual 
network measures. 

Note that in principle, the above estimator ky can be inter- 
preted as a special case of the strength Sy — Y.ie./K. ^^'i °f ^ node 
V in a directed, (link- )weighted network in which we simply use 
the node weight w, of the target node as the link weight w,,, of 
the directed link. For other network measures, this interpreta- 
tion is, however, not helpful since the measure might have no 
counterpart for directed weighted networks, or that counterpart 
is unsuitable for our problem (as in case of the clustering coef- 
ficient, see below). 

4.2 Numerical approximation 

If the domain of interest Go provides some notion of (geomet- 
ric) distance (like any spatially embedded network does), an 
alternative approach is to consider each node v e ^ as repre- 
sentative for a small cell of points in v's geometrical vicin- 
ity in J\i), whose size (in terms of some suitable measure, e. g., 
Lebesgue measure) we denote by w,,. By geometrical vicin- 
ity we mean those points of the underlying domain of interest 
that have a small geometrical distance from v, as opposed to 
its neighbourhood that consists of those nodes in the net- 
work with a network-theoretic distance ^ 1 . This interpretation 
would be adequate, e. g., if is a continuous manifold and ^ 
a subset of points on a grid or derived by some meshing proce- 
dure (e. g., adaptive mesh refinement, [37|). 

If, because of the continuity properties of the underlying 
system, it can be expected that all nodes v' £ Sly are linked 
to more or less the same nodes in Go as v is, then a natural 
approximation for an interesting statistics /o of Go would use 
the aggregation weight w,, wherever the formula for /o involves 
the node v. 

E. g., for the above measure of degree, we could again 
use ky = Y-ie^Vy^i instead of ky — \,yVy\ to approximate A:o(v), 
since each node / in v's G-neighbourhood represents vv, "many" 
nodes in Go of which most can be expected to be linked to v as 
well (Fig.|5]illustrates this idea). In the context of climate net- 
works, each node / represents a portion of the Earth's surface 
of relative size w,- = cos (latitude of /), and ky is known as area 
weighted connectivity |26|. 

In many cases of continuous domains of interest Go, a 
point V E ,yV() is usually also linked to all or at least most of 
the points in its geometrical vicinity. More formally, in many 
cases the following local connectedness condition will hold for 
some suitable distance function d: for all v e jVq there is e > 
such that each point ; e o/^6 \ {v} with d{i,v) < e is linked to 
V. E.g., if ^ is the Earth's surface, and i,j e ^ are linked 
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Fig. 5. (Colour online) A set ,_.-V of nodes (circles) representing cells 
of different size of the domain of interest (dashed triangles). In many 
applications, a node v will be linked to one or more smoothly bounded 
regions of the domain of interest (red surrounded regions), containing 
V itself and some other nodes (filled circles; arrows show the resulting 
links in the network G). The red surrounded area of size fco(^) can then 
be approximated by the grey shaded area of size fc,,, or more accurately 
by the grey shaded area plus v's own cell area, giving the estimate k*,. 
The classical degree fc,, is just the number of filled circles. 

in Go iff the surface temperature time series for / and / ex- 
hibit a product-moment correlation coefficient larger than some 
threshold value, then the smoothness of the underlying physics 
will imply the above. 

In such a network, an alternative approximation to ko{v) 
would then be 

k*. ^ ky + w„ = Wi = Y.ie.yr Wiul, , (8) 

which can be expected to be a better approximation if the mesh 
is fine enough. This estimator can also be interpreted as a clas- 
sical numerical approximation to the integral of the indicator 
function of v's unpunctured neighbourhood in Go (see Fig. [5] 
again). 

However, for many of the more complex network mea- 
sures we will study below, e. g., random walk based measures 
or spectra, it will not be possible that easily to interpret our 
weighted versions of those measures as approximations to in- 
tegrals, and a thorough analysis of their approximation quali- 
ties would require much technical effort. This is why we rather 
pursue a third, more pragmatic approach, which is motivated 
by a simple property that both the statistical and the numerical 
approximation interpretations have in common. 

4.3 Pragmatic axiomatic approach 

In both the statistical and the approximation approach, it is 
clear that the estimation or approximation should usually be- 
come better when the resolution of G as a description of the 
domain of interest Go is increased by replacing some or all 
nodes by a larger set of nodes representing smaller parts of 
Go. Such refinements would usually change the corresponding 




Fig. 6. The operation of node splitting replaces a node s of weight Wj 
in network G with two linked nodes J , s" of weights + = Wj 
which get the same neighbourhood as s had, giving network G'. The 
inverse operation of twin merging transforms G' back into G. 

inverse sampling densities or cell sizes that we use as our ag- 
gregation weights Wy. Let us now consider the case of suffi- 
ciently high resolution, i. e., where the sample is dense enough 
or the cell sizes are small enough to resolve all structural fea- 
tures of Go that are considered relevant, so that we do not ex- 
pect there to be considerable inhomogeneities inside the region 
of Go represented by each individual node. Now imagine that 
an elementary refinement of ^/K was performed in which only 
one old node s E jY was replaced by two new nodes s' ^s" G jVi^ 
which together represent more or less the same subset of ^ 
as s did. Then this would leave the aggregation weights w, of 
the other nodes / € ^\ mostly unchanged, whereas 

the weights w^/ and yv^n of the new nodes would approximately 
sum up to the former Ws- Also, because the resolution was as- 
sumed to be sufficient already, s' and s" would be linked to 
more or less the same nodes as s was, and most likely also 
to each other. In that case, a good estimate or approximation 
/ to some statistic /o of Go should probably become some- 
what more precise, but should certainly not be changed much 
by such an elementary refinement. This intuitive reasoning can 
be turned into a simple axiomatic guiding requirement when 
we idealize the above situation as follows. 

Let G = (c/K, (f) be a simple graph with weights w; > for 
all / g JV , and let s e be some "old" node, s' ^s" ^ .jV two 
"new" nodes, Wj/,wy/ > their weights, and +Ws// — w^. 
Then the "refined" graph G' = (^', S') with 

^' = ^\ {4 U 

and ^' = {{/,;■} e^:/,;V 4 W 
U {{/,/},{/,/'} :{/,4G<f}u {{/,/}} 

and with weights w,- for all / € jV' will be called a node split- 
ting refinement of G. That is, G' is derived from G by "splitting" 
the node s into two new interlinked nodes s\s" with the same 
total weight as s and linking those two to exactly those nodes 
to which s was linked (see Fig. [6]). 

Now a (global) network measure / will be called node split- 
ting invariant (n. s. i.) iff 

/(G')-/(G) (10) 

for all pairs of networks G, G' and all weight functions w for 
which G' is a node splitting refinement of G. Likewise, if /,■ is 
a network measure which is defined for nodes then fi will be 
called node splitting invariant iff 

/;(G')=/,(G) and /.'(C) = /s"(G') - /s(G) (H) 

for all such G,G',w and all / e ^ \ {s,s' ,s"}. Similar defini- 
tions are possible for measures with more than one argument 
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but are not needed here. In other words, a n. s. i. measure is un- 
ajfected by any node splitting refinements. This is, however, not 
to say that we actually perform any refinements or node split- 
tings when applying a node splitting invariant measure. Rather, 
the notion of node splitting must be thought of as only a hy- 
pothetical, idealized operation that could be applied in princi- 
ple, and the consistency requirement of node splitting invari- 
ance is only to make sure that the node weights are used in a 
proper way that is suitable for links which represent some kind 
of "similarity". In other words, to define a n. s. i. measure only 
requires us to analyse what would happen if nodes were split 
in the suggested way. To apply a n. s. i. measure to a given net- 
work, one can just use the corresponding formula as one would 
with an unweighted measure, without any need to think about 
node splitting. 

In the context of coarse-graining, node splitting invari- 
ance can also be called twin merging invariance, requiring that 
/ should not change whenever two twins s',s" with weights 
Wj/ , w^j! are merged into one node s with weight Wj = w^i + w^ii , 
where two linked nodes s',s" are called twins iff they have the 
same neighbourhood, = ^/^. While node splitting corre- 
sponds to an idealized form of refinement, twin merging is an 
idealized form of coarse-graining and is the inverse operation 
of node splitting, hence both notions of invariance are equiva- 
lent. 

While the above definitions of node splitting and twin 
merging are suitable for networks in which links represent a 
kind of similarity or direct connection, for which it is natural 
to require that twins are linked and to assume that the parts 
of a split node are linked, other types of networks might call 
for a different definition of node splitting and twin merging. If, 
e.g., links represent some kind of "complementarity" instead 
of similarity, a natural kind of splitting would leave s' and s" 
unlinked, and the corresponding definition of "twin" would re- 
quire that twins are not linked. Such a variant would lead to 
weighted network measures similar but not identical to ours, 
but we do not pursue this in the present paper. 

An example of a n. s. i. measure is the above-defined 
n. s. i. degree k*. — Y^i^jk,;^ ^^'^ ^^at it is indeed n. s. i. is eas- 
ily seen from the fact that 

■vV+{G') = ^+(G') = ^+(G)\{.}U 

^+(G') = ^+(G)\WU (12) 
and ^+(G') = (G) 

for all / e ^(G) and j e^\jV+{G). 

Note that the definition of node splitting invariance aka 
twin merging invariance does not at all rely on the formal spec- 
ification of an underlying domain of interest Go, but it depends 
on the network G and the weights w,- alone, which makes this 
tool much easier to use than estimation theory or approxima- 
tion theory. Nevertheless, a conjecture and working hypothe- 
sis of this paper is that n. s. i. measures are the natural candi- 
dates for good estimation or approximation of the correspond- 
ing properties of a potentially underlying domain of interest, 
and that they will usually prove to be statistically consistent 
and exhibit good convergence properties when the domain of 
interest and the sampling or meshing procedures fulfil some 



suitable continuity or measurability properties and when the 
aggregation weights w, are chosen accordingly. 

In the following, we will therefore present n. s. i. versions 
/* of a number of local and global network measures / that 
can be found in the literature, and we will refer to the possibly 
underlying domain of interest only when motivating some in- 
terpretations of these measures, but without formally defining 
that statistics /o of Go which is supposed to be estimated or 
approximated by /* . 

The basic construction mechanisms we will use are 

(i) to sum up aggregation weights wherever the original mea- 
sure counts nodes, 

(ii) to use unpunctured neighbourhoods ^+ wherever the 
original measure uses punctured neighbourhoods (in 
other words, to consider v as linked to itself), 

(iii) to also allow for equality of /, /' wherever the original mea- 
sure involves a sum over distinct nodes /, j, and 

(iv) to "plug-in" a n. s. i. version of a measure g wherever this g 
is used in the definition of another measure /. 

Both mechanisms (i) and (ii) were used in the definition of ^* 
above, and an example for mechanisms (iii) and (iv) will be 
given in the following section when we will consider the clus- 
tering coefficient. 



5 Local measures 

A network measure /,, = /v(G) that is defined for each node v € 
,yV will be called local here. (Note that we always understand 
the terms "neighbour" and "local" as referring to the network 
topology, not to some possibly underlying geometry. So, local 
neighbours might be geometrically far apart.) 



5.1 Degree 

We already treated the degree measure ky = \^.\ and defined 
the n. s. i. degree of v as 

^:=IfeX.+ vv,- (13) 

Let us compare ky and k*, in two example networks. 

In our human brain example network, the nodes with the 
highest degree ky are the right lingual gyrus and the left pre- 
cuneus region (LING.R and PCUN.L, see |7| for these abbre- 
viations), connected to 32 and 31 of the other 89 nodes, i.e., 
to about one third of the nodes. The volume of the 90 indi- 
vidual ROIs varies by a factor of 23 (Fig.[T]l If it is used as a 
node weight for n. s. i. degree, the right lingual gyrus and the 
left precuneus again have the largest values, but these are now 
k*. « 0.46W, showing that in reality, both regions are function- 
ally connected not to one third but rather to almost half the 
entire brain (in terms of volume). The third most connected 
node in terms of k* (volume) is the left middle frontal gyrus 
(which seems consistent with other measures of node impor- 
tance as reported in [TJ), but in terms of ky (nodes) it is the left 
calcarine cortex (CAL.L). Since the number of linked nodes 
basically depends on the level of detail in the used parcellation 
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Fig. 7. (Colour online) Log-log plot of the complementary cumula- Fig. 8. Clustering coefficient Cy vs. n. s.i. clustering coefficient C* 
tive distribution function of degree k^, (thin black line), node weight of those nodes in the functional human brain network that have at 
Wy (dashed line), and n. s. i. degree k*, (thick blue line) in the rout- least two neighbours, showing considerable differences (Pearson cor- 
ing network of autonomous systems in the internet. Power laws would relation is p = 0.65). Disk area is proportional to node weight (ROI 
appear linear. volume). 



of the brain, ky can change considerably for different parcella- 
tions even if the ROI represented by node v remains unchanged, 
and ky seems to be influenced much more than A:* by the choice 
of parcellation. 

In the internet example network, the earlier findings of 
power laws for the degree distribution reported in 1 12,^13,, 15] 
are supported by the apparent linear relationship in the log-log 
plot in Fig. |7] (thin black line). In other words, the distribution 
of the number ky of linked AS's of a given AS v seems to fol- 
low a power law. However, also the distribution of the size w,, 
of a given AS v in terms of CIDR prefixes seems to follow 
a power law, as can also be seen in Fig.|7] (dashed line), and 
the linear correlation coefficient between \viky and Invv,, is high 
(« 0.5), so the power law for ky might be a consequence of the 
one for Wy. If we ask for the share of the internet (instead of 
the number of AS's) a given AS v is linked to, k*/W seems a 
more accurate estimate of this than ky/N, and the findings are 
different: The AS with highest ky also has the highest k*,, but 
while it is linked to only 7.9% of all AS's (since ky = 0.079A^), 
it seems to be linked to approx. one fourth of all IP adresses 
(since k* — 0.24W). When plotting the probability that a ran- 
domly chosen CIDR prefix belongs to an AS that is linked to 
other AS's with more than x total CIDR prefixes, this does no 
longer show a clear power law behaviour (Fig.|7] thick blue 
line). Also, Ink*, is less strongly correlated (w 0.3) to In w,, than 
\Tiky is, and less strongly correlated (« 0.4) to InA:,, than Inwy 
is. 



5.2 Clustering coefficient 

The local clustering coefficient of v. 



e[0,i], (14) 



is the probability that two nodes drawn at random from those 
linked to v are linked with each other. We get a weighted ver- 
sion by employing all four mechanisms (i)-(iv), giving the 
n. s. i. local clustering coefficient 



Wy{2k*, - 



Wy) 



C[0,1], (15) 



which estimates the probability that two weight units (or points 
in terms of Go) drawn at random from the part of the network 
linked to v are linked with each other. Because we use A+, 
C* tends to be larger than C,, if ky is small, and it is defined 
for all nodes while C,. only makes sense when ky > 1. If the 
weights vary considerably, C* and Cy can rank the nodes quite 
differently, although neither needs to be significantly linearly 
coiTelated with Wy. C* and C,. can also differ considerably when 
the weights inside ,yVy vary strongly. 

All these effects can be seen nicely in our brain example 
network (Fig. [8]). The right thalamus region (THA.R), e. g., is 
in the top half (rank 37) according to its C,, of 0.54, but almost 
at the bottom (rank 86 out of 90) according to its C* of 0.53, 
although the absolute values are almost equal. This is because 
among its 21 neighbours, it is rather the smaller ones (like 
THA.L and CAU.R) that are linked with many other neigh- 
bours. 

One might think that similar to the case of degree, also 
a directed version of the formula for the local clustering co- 
efficient from the theory of (link-)weighted networks |25l, 
which is c;r = a/./l^'v,' + w,.j)/2iv(fcv - 1), could 

be a good candidate if the directed link-weights are defined as 
Wyi = Wi, giving c;r = 'Lie.Yy'Lje^-Vy'^iji^i + Wj)/2sy{ky - 1). 
But it is easy to see that this does not behave well under 
node splitting or twin merging since a linked pair / — j con- 
tributes w, + Wj instead of WiWj. If, e. g., ^ = {/, j, s' ,s"} with 
Wi ~ Wj = Ws' = Wy/ = 1, aij = 0, and s',s" are twins linked 
to / but not to j, then c"' = C,, = 3/6, but after merging the 
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twins s\s" into one new node s with Ws = 2, one has ky = 3, 
Sy — 4, and cjl' = (w,- + Ws)/2sy{ky — 1) =3/16, much smaller 
than before. In contrast, C* = 19/25 before and after the twin 
merging. 



5.3 Measures of centrality and betweenness 

Many measures try to assess several aspects of "node impor- 
tance". Based on the distances of one node v to all others, we 
consider three variants of closeness centrality. 



CCy^l/{dyi)i, CC;.= (2 



CC'^={l/dy, 



(16) 



the latter also being called the efficiency of v, where dyi is the 
number of links on a shortest path from v to /, or, if there is 
no such path, either °o or A^, depending on the convention cho- 
sen ||3] [38|j40) . A weighted version of CC,, should give us the 
inverse average distance of v from other weight units or points 
rather than from other nodes. But for this to become n. s. i., 
one has to interpret (somewhat peculiarly) each node to have 
unit (instead of zero) distance to itself. This is because after an 
imagined split s — > s',s", the two parts s' ,s" of s have unit not 
zero distance. The n. s. i. distance function is hence given by 



1 and d*i — d^ for / ^ v, 



(17) 



i. e., the zeros on the diagonal of the ordinary distance matrix 
are replaced by ones to get the n. s. i. distance matrix, without 
changing any off-diagonal entries. Using this, we can derive v's 
n. s. i. closeness centrality measures as 



CCl = 



1 



w 



w 



Wid* 



Wy 



Widyi 



(18) 



CC* = {2-<--Y^\ and CC,','* = (1 /d* All take values in [0, 1]. 

In the internet example network, CC,. and CC* are gener- 
ally quite small, vary only little, and are highly correlated. Still, 
both measures lead to different rankings of the most central 
nodes (Fig.[9]l. 

Also in our Wikipedia example network, CC,. and CC* 
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were not very discriminatory, but CC[, and CC* were. Fig 
depicts their relationship for the most central nodes, again 
showing considerable differences in ranks. According to CC,',, 
the top ten central articles in decreasing order are those 
named "Physics", "Mathematics", "Chemistry", "Germany", 
"Science", "Physicist", "Quantum mechanics", "Albert Ein- 
stein", "Astronomy", and "Engineering", while according to 
CC'* the list is "Physics", "Mathematics", "Germany", "Chem- 
istry", "Science", "Japan", "Italy", "Albert Einstein", "Russia", 
and "Astronomy". "Physicist" for instance has the sixth highest 
value of CC'y but only the 21th largest value of CC'* because it 
is linked to a very large number (7.6% of the nodes) of com- 
paratively short articles (accounting for 5.5% of the total text) 
on individual scientists, so that CC,', treats them as a larger part 
of the network than CC'* does . 

Other, somewhat more sophisticated importance measures de- 
pend on the paths between all other nodes that lead through a 



U 
U 




Fig. 9. Closeness CC,. vs. n. s. i. closeness CC*, of those nodes in the 
internet network (see text) with the highest values. Disk area is pro- 
portional to node weight (no. of CIDR prefixes). 




Fig. 10. Exponential closeness CC[, vs. its n. s. i. version CC'y of those 
Wikipedia articles on physics (see text) with the highest values. Disk 
area is proportional to node weight (article size in characters). 



given node v. The (shortest path) betweenness of v is the pro- 
portion of shortest paths between randomly chosen nodes a^b 
that lead through v: 



BCy = {nab{v)/n„h)ab e [0, 1], 



(19) 



where nab is the total number of shortest paths from a to b, 
and nab{v) is the number of those paths that pass through v as 
an inner node. Formally, can be written as a sum over all 
node tuples (fo, ■ . . with tQ — a and f^/^^ = b, where the 
summands are either zero or one, depending on whether each 
t( is linked to its successor t(^\, for ^ = 0, . . . ^dab — 1. As the 
latter condition can be written as a product of elements of the 
adjacency matrix, we have: 



nab 



(20) 



(ro,...,rrf„jG.-/K''«6+' 
to=a., tj.=b 
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Fig. 11. Interpretation of n. s. i. shortest path betweenness as the prob- 
ability density that a randomly chosen shortest path (grey curves) be- 
tween two randomly chosen points (grey dots) in the domain of inter- 
est Go passes through a randomly chosen point (black dot) in the area 
(dashed region) represented by v (black circle). 

A similar formula holds for nab{v), only that for some m in 
1 . . . dab ^ 1 > tm must equal v: 

nab{v)^tit\' E n'fi (21) 

ta=a, fm=v, tj^i^=b 

When s is hypothetically split into s' + s" , any shortest path 
through s becomes a pair of shortest paths, one through s' and 
the other through s" . Also, a shortest path from s" to some b ^ 
s' will never meet s' . Thus, to make BCy n. s. i., it suffices to 
make and nab{v) n. s. i. by making each path's contribution 
proportional to the weight of each inner node (that is, to the 
product of these weights!), where in case of «afo(v) we have to 
skip Wy in this product: 

Kb^ H flroJl n^f2('^'f-i'^'f-lff)' 
(as above) 

n:b{^) = ^Xt=l' E («.o',n'f2K-,«^.-K.))- (22) 

(as above) 

The n. s. i. shortest path betweenness 

BC:-(«:,(y)/«:,)»,e[0,l/w„] (23) 

can then be interpreted as an estimate of the probability (or 
probability density) that a randomly chosen shortest path be- 
tween two randomly chosen points in the underlying domain 
of interest Go passes through a specific randomly chosen point 
in the area My represented by v, as illustrated in Fig.[TT] The 
product WyBC*, then estimates the probability that such a path 
passes through any point in My, which is not n. s. i. but is addi- 
tive under node splitting: w^iBC*, + w^/BC*,, = WsBCl- 

Newman |41 1 gives an (9(|^| |(f|)-time algorithm to com- 
pute BCy for all v, based on Dijkstra's algorithm, and this can 
easily be adapted to compute BC* for all v with the same algo- 
rithmic time complexity. 

In our human brain example network, according to BCy 
the top ten ROIs in decreasing order are the nodes TPOsup.R, 
LING.L, LING.R, TPOmid.L, MFG.R, TPOsup.L, MFG.L, 
STG.L, CAU.R, and ORBinf R, while according to BC* the 
list is TPOmid.L, TPOsup.R, LING.R, LING.L, TPOsup.L, 
PCL.R, PHG.L, THA.R, CAU.R, and STG.L. The left and right 
middle frontal gyrus (MFG) are missing in the latter list (hav- 
ing ranks 13 and 19) because most shortest paths that lead 
through them are between relatively small nodes, (left and bot- 
tom region in Fig.[l]l, whereas the right thalamus (THA.R) has 



a high degree but many of its larger neighbours are not linked to 
each other, leading to a large value of BC* . In this network one 
can also nicely see the effect of network design choices on net- 
work statistics. If we slightly modify the parcellation and treat 
the mid-sized left and right parts of the dorsal cingulate gyrus 
as one node s instead of two {s' =DCG.L and s" =DCG.R, hav- 
ing quite similar neighbourhoods), leaving the rest of the net- 
work unchanged, then their BC-values increase considerably 
from BCj/ = 0.029 and BC,// = 0.036 to BC, = 0.053, or from 
24th and 18th rank to 11th rank. Their BC* -values, however, 
behave rather nicely in that the new value BC* = .0000144 
(rank 19) is approximately the average of the two old values 
BC* = 0.0000114 (rank 23) and BC*„ = 0.0000171 (rank 17). 
Had DCG.L/R been exact twins, BC* would not have changed 
at all. 

More centrality and betweenness measures will be dis- 
cussed in the next subsection and in Sec. 16. 21 

5.4 Measures based on random walks 

Several network measures are based upon the idea of a random 
walk along the links of a network without isolated nodes, with 
all neighbours j of a node / having the same transition proba- 
bility 

P<j = atj/ki. (24) 

The crucial ideas in constructing n. s. i. versions of such mea- 
sures are to make pij proportional to Wj and to allow the walk 
to stay at node / (which now also allows for the existence of 
isolated nodes): 

K,=4w,A*. (25) 

Such a walk can be thought to approximate a random walk in 
Go which moves to each linked point with the same probability. 
If Go is a continuous domain, this discrete random walk must 
not be confused with a continuous Wiener process, however. 
In particular, its individual steps might bridge long distances in 
terms of the domain's geometry. 

With the above transition probabilities, the probability of 
visiting or staying inside {s' ,s"} after a split s — >■ s' +s" is 

P*y,'+P*y," = P*vs or P./,/ +Py,// = P>,/ + Pill, I, = P%, respec- 
tively, which means the walk is not influenced by the split. With 
the original transition matrix P — {pij)ij, the equilibrium dis- 
tribution and hence the long-time average relative visiting fre- 
quencies are given by py = ky/K, where K = N{ky)y is twice 
the number of links in G. With the n. s. i. transition matrix 
P* = (p*j)ij, the equilibrium distribution is p*, = k*,/K*, where 
K*=W{kX. 

The Arenas-type random walk betweenness of v, motivated by 
|j42l| and based on the idea of searching a target, is the expected 
number of visits to v on a random walk that starts and ends at 
some randomly chosen nodes a, b: 

ABy = {ABay{b))ab (26) 

with AB(Z7) = {ABij{b))ij = Y.T=i P' {b)' . Since the walk is as- 
sumed to stop as soon as b is reached, the "transition" matrix is 
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here P'{b) = {p'-Jb))ij with 



P'ijip) = {l-dti)aij/ki. 



(27) 



The main problem in attaining node splitting invariance is the 
stopping condition introduced by the term —5/;,. If the target 
node b is split into b' + b" , and if b' is the new target node, 
the walk must not continue after reaching b" since otherwise 
ABv would increase. Hence the walk must sometimes stop ear- 
lier than before, at least when reaching a twin of the target. As 
exact twins are usually rare in large networks, it seems natural 
to adopt a somewhat more continuous stopping condition that 
may stop the walk with some probability as soon as it enters the 
neighbourhood of the target. Using a suitable n. s. i. similarity 
measure (7* (/, /') G [0,1] that equals zero for unlinked nodes 
and one for twins, we can then define a n. s. i. Arenas-type ran- 
dom walk betweenness: 



ab; = (AB:„(fe))- /w„ 



(28) 



with AB*{b) = {AB*j{b))ij = I^li P'*(fe)' and "transition" ma- 
trix P'*{b) = {p'*j{b))ij, where 



p'*j{b) = {\-a*{b,i))at.wj/k*. 



(29) 



For the similarity measure we may, e. g., use one out of the 
increasing sequence 
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(30) 



with w(M) — LygM^^'^v for M C A^. In all these versions, the 
walk may stop with some probability when a more or less twin- 
like neighbour of b is reached. If nodes can be expected to 
"know" their neighbours with some probability, this stopping 
behaviour can also be interpreted as meaning that once a neigh- 
bour of the target is reached, the path to the target will more or 
less likely be known, hence the target can be considered to be 
found and the random search walk can stop. 

In the brain example network, both the unweighted and 
n. s. i. version (using Oyu) of Arenas-type random walk be- 
tweenness give the same set of nodes with top ten be- 
tweenness values, but in quite different order: for ABy it 
is LING.R, PCUN.L, CAL.L, PCUN.R, MFG.L, MTG.R, 
DCG.L, DCG.R, SFGmed.L, and MFG.R, while for AB*, it 
is PCUN.L, LING.R, MFG.L, CAL.L, MTG.R, PCUN.R, 
SFGmed.L, DCG.L, MFG.R, and DCG.R. 

In our world trade example network, when we use as node 
weights the considerably varying countries' gross domestic 
product in 2008 (as reported by the IMF), then according to 
ABv the countries with the largest betweenness are CHN, USA, 



DEU, FRA, JPN, IND, RUS, ITA, SGP, and KOR, whereas 
according to AB*, the ranking is much different, USA, CHN, 
JPN, KOR, PAN, SAU, MYS, PHL, VNM, and GAB, in de- 
scending order The last six are all connected to all three of the 
heavy-weight nodes USA, CHN, and JPN (or FRA in case of 
GAB), which explains their high AB*- values. Also in the layout 
in Fig. [2] they are more centrally located than those in the first 
list. Germany (DEU) and France (FRA) on the other hand, are 
missing from the second list mainly because they are connected 
to neither of those three big economies directly. A much more 
realistic model of the world trade network would of course use 
weighted and directed links representing actual imports and ex- 
ports, in addition to node weights, so one cannot attach much 
real-world importance to the above exemplary results. 

In the preceding type of random walk betwenness, each indi- 
vidual visit of the walk to v is counted. Newman |[43) intro- 
duced a similar measure in which, however, only the "net" flow 
of the walk along each link is considered. A more intuitive 
interpretation of his measure is the expected effective current 
Iah{v) passing through v when the network is interpreted as an 
electric circuit with all links having unit conductance, and a 
unit current is sent from a random node a to a random node 
b. This explains the definition of Newman's random walk be- 
tweenness 



NBv = {Iat{v))ab 

with Iab{i) ^ \Y.ie.^i\^M,b)-^i{a,b)l 



(31) 



where the "electric potential" vector V (a, ii) is given by Kirch- 
hoff's equations 



\V(a,b) = 5(a) - 5{b) with 5,(c) = 4 



(32) 



where A = diag(A:) — A is the Laplacian matrix of G which will 



be studied more closely in Sec. 6.2 As with the stopping con- 
dition above, here the main problem in finding an n. s. i. version 
is that the current leaves the circuit at a single node b, so that 
when b is split into b' + b" and the current leaves at b' , the twin 
b" will have a different electric potential than b' . This can be 
overcome in the same way as above, by using some n. s. i. simi- 
larity measure a* and letting a part of the current that is propor- 
tional to Wi<J*{b,i) leave the circuit also at each neighbour ; of 
b. A similar thing must also be done at the a-side where the cur- 
rent enters. Finally, each link / — j must get conductance WiWj 
to reflect the fact that the link / — j represents a bundle of links 
in Go between the w, many points represented by / and the Wj 
many points represented by j. In order to avoid a dominating 
influence of the degree on our measure, we also restrict a and 
b to nodes not directly linked to v, although this would not be 
necessary to achieve node splitting invariance. For the choice 
(7* = Oyji in which the current enters and leaves at all nodes in 
the neighbourhood of a and b to some extent, this approach is 



depicted in Fig. 12 

All these considerations lead to our definition of n. s. i. 

Newman-type random walk betweenness 



NB:, = {{l-atMv){l-a+,)Y:, 
with \:„{i) = {^.^^^^Wj\V*{a,b)~V*{a,b)l 



(33) 
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Fig. 12. Interpretation of n. s. i. Newman-type random walk between- 
ness as the expected (electric) current flowing through v when a unit 
current flows between the neighbourhoods of two randomly selected 
nodes a and b and each link's conductance is proportional to both its 
ends' weights (here represented by line thickness). 



where V* (a,h) is given by the new equations 
D„A*y*(fl,fe) = 5'(fl)-5'(/7), 
5/(c) = w,-(7*(/,c)/I^.g^^+ w^-(7*(;,c), 

and where 



(34) 



D,, = diag(w) and A* = diag(r ) - A+D„. (35) 
are the diagonal matrix of aggregation weights and the n. s. i. 



Laplacian matrix (see Sec. 6.2 1. hi this way, after a split s 



s + i , all potentials V* {a,b) with / =^ s remain unchanged, and 
V*,{a,b) and V*„{a,b) equal the former V*{a,b). The measure 
NBl estimates the expected effective current passing through 
each unit-sized part of the region of Go that is represented 
by v, when the domain of interest Go is interpreted as an electric 
circuit with all links having unit conductance, and a unit current 
is sent between two random points of Go. 

In the brain example network, this time the top ten lists 
according to the unweighted and n. s. i. versions of Newman's 
betweenness measure differ more than for Arenas-type be- 
tweenness: We get TPOsup.R, LING.L, LING.R, ORBinf.R, 
MFG.R, MFG.L, TPOmid.L, CAU.R, STG.L, and ITG.R 
according to NB,„ but TPOmid.L, TPOsup.R, TPOsup.L, 
LING.L, ORBinf.R, TPOmid.R, CAU.R, THA.L, STG.L, and 
THA.R according to A^B*. For example, MFG.L/R is again 
missing in the second list while THA.L/R is included, and the 
explanation is the same as for the case of shortest path between- 
ness (Sec. |5.3| l. 

Newman-type random walk betweenness is also of interest 
in climate networks, see Sec.|7] 



6 Global measures 
6.1 Aggregate measures 

Popular aggregate network statistics include the global cluster- 
ing coefficient C = (Cy),, S [0, 1] with n. s. i. version 



c* = {c;)"^' e [0,1] 



(36) 



the global transitivity T — {ayiaijap,)yij/ {ayi{l — 5ij)ajy)vij 
which is closely related to C* and has the n. s. i. version 



{a a- a ■ ) ■ ■ 

\ VI ij JVl VI J 



(a- a ■) ■ ■ 

\ IV Vjl VIJ 



e[0,l], 



(37) 



the link density p ~ G [0, 1), whose n. s. i. version is 

(fe:)-e[0,l] (38) 



the average (geodesic) path length or mean geodesic distance 
L = {dij)ij > with 



L* = (d*j}^j > 



(39) 



and the global efficiency |l40| E — {\/dij)i^j G [0, 1] for which 
we can use 



E* = 



(i/«f;})-e[o,i]. 



(40) 



Many authors, e. g., compare C and L to their values for ran- 
domly rewired graphs to assess the "small-worldness" of the 
network in a way that is quite sensitive to small structural 
changes in the network (see [44|), and using C* and L* for this 
task should at least reduce that part of this non-robustness that 
is related to node selection and aggregation. 

In our internet example network, we get C = 0.21, i. e., for 
a randomly chosen AS v, two randomly chosen other AS's that 
are both linked to v are also linked with each other with prob- 
ability 21%. If the AS network is constructed more meticu- 
lously than here, combining several data sources, somewhat 
larger values around 0.4 are found |45|. On the other hand, 
we have C* = 0.8, meaning that for a randomly chosen IP ad- 
dress X, two randomly chosen other IP addresses y,z that can be 
reached from x in at most one routing step, can also reach each 
other in at most one routing step with 80% probability. One rea- 
son for the large difference between C and C* is the low aver- 
age degree of only (^,,)v = 4.24, which means that many of the 
above pairs y,z belong to the same AS and therefore contribute 
to C* but not to C. Also, we get E w 0.27 and E* « 0.336, 
which can be interpreted as indicating that a typical distance 
between two randomly chosen AS's is 1/0.27 « 3.7 routing 
steps, while between two randomly chosen IP adresses it is 
only 1/0.336 sa 2.98, which is 20% less, showing that large 
AS's tend to be more central than small ones. 



6.2 Characteristic matrices and spectral measures 

Spectral network analysis deals with the eigenvalues and 
eigenspaces of characteristic matrices such as the adjacency 
matrix A or the Laplacian and normal matrices 



A = diag(fc)-A, T = diag(;t)"'A, 



(41) 



of which we will only consider the first two here. Spectral anal- 
ysis can be used to study the centrality of nodes and the com- 
munity structure of the network, and to find natural partitions 
or node classification trees. 

Depending on which matrix properties are considered essential, 
there are at least two different ways in which these matrices can 
be made n. s. i. in some sense. One way is to multiply both the 
rows and columns with the square root of the corresponding 
node weights, similar to what is done in a different context in 
the estimation of empirical orthogonal functions from gridded 
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data (e. g., p4)), thereby preserving the symmetry of the ma- 
trix. For the adjacency matrix, this gives 



pl/2^+pI/2 



(42) 



which has the property that any solution to the eigenequation 
A*'x = Xx is n. s. i. in the sense that after a node splitting s 
s' + s", the resulting new A*' still has the eigenvalue A with 
some new eigenvector y, the entries Xy of the eigenvector x that 
belong to non-split nodes v^s remain unchanged, — Xy, and 
the quotient of eigenvector entry and square root of the weight 
is invariant also for the split nodes, y^i / ^/w^ = y^" / y/w^n = 
Xs/ y/w~s. In particular, y has the same i?2-norm as x. In a sense, 
the splitting of node s results in a corresponding "splitting" of 
all the eigenvectors' ^-dimension. 

After the split. A*' has an (A^+ l)st eigenvalue of zero, cor- 
responding to the eigenvector z with z,/ = ^Jwji, Zs" = — 
and Zv = for V 7^ s' ,s" . In particular, the m-th moment ^ A/" 
of A*' becomes n. s. i. if the normalization 1 /N is replaced by 
the n. s. i. normalization 1 /W: ^ L/'^"'- ' do&s not refer to 
a node in this sum, we do not need to weight A/" with w,. The 
sum itself is already n. s. i. since all non-zero A,- are.) 

An alternative and simpler possibility is to use the node 
weights only for the columns and put 



A* =A+D,„ 



(43) 



destroying symmetry and getting different eigenvectors y as 
with A*', although the eigenvalues are the same. The entries 
of the eigenvector x that belong to non-split nodes / ^ s still 
remain unchanged, but this time s' and s" directly inherit their 
eigenvector entries from s, yy = jy/ = Xs, so the new eigenvec- 
tor y has the same l^-norm as x. 

As both A* and A*' are non-negative, the Perron-Frobenius 
theorem guarantees the existence of a non-negative eigenvalue 
of largest absolute value and corresponding non-negative real 
eigenvectors x* and x*' . Both can therefore be used to get the 
n. s. i. version of another popular centrality measure: The (ad- 
jacency) eigenvector centrality ECy of v is the entry Xy of the 
(non-negative) eigenvector x for A's largest eigenvalue, where 
X is usually normalized so that max^jCy — 1. The above shows 
that we can define n. s. i. eigenvector centrality as EC* = x* = 
X*! / ^/w^,. A similar measure based on the normal matrix T is 
closely related to a major web search engine's page rank mea- 
sure. 

In the brain example network, the top ten central nodes 
according to ECy are PCUN.L, LING.R, CAL.L, PCUN.R, 
DCG.L, DCG.R, MTG.R, SFGmed.L, CUN.L, and PCG.R, 
and according to EC*y they are PCUN.L, MTG.R, MFG.L, 
SFGmed.L, PCUN.R, DCG.L, LING.R, DCG.R, CAL.L, and 
MFG.R. Curiously, this time the second rather than the first 
ranking includes MFG.L/R which in this case is probably just 
because they have a much larger weight than CUN.L and 
PCG.R (which are only in the first ranking). 

For the Laplacian matrix A, also both constructions are possi- 
ble, leading to 



While the row sums of A* are zero just like for A, those of A*' 
need not equal zero, but the latter matrix is symmetric. Still, 
both matrices have the same eigenvalues, and those and their 
eigenvectors behave in the same way under node splitting as 
those of A* and A*', except that the additional (A^+ l)st eigen- 
value is now k* instead of zero. 

The non-symmetric version A* can also be interpreted as a 
Laplacian matrix of a directed network in which the links in- 
stead of the nodes are weighted, with link weights w,,- = w;. If 
the network is the result of the spectral coarse graining proce- 
dure described in 1|46J, A* equals the Laplacian derived in |46 
Eq. 3]. 



A* diag {k* ) - A* and A*' diag {k* ) - A* 



(44) 



The spectral bisection method of Fiedler | |47| uses the signs 
of the eigenvector of the smallest positive eigenvalue of A to 
find the most distinguishable two groups of nodes. Using ei- 
ther A* or A*' in the same way will provide the n. s. i. spectral 
bisection, both giving the same result since only the sign of the 
eigenvector entries matters. 

An enhanced version of spectral bisection uses eigenvec- 
tors of Newman's |24| generalized modularity matrix to iter- 
atively find communities. Fig.|2] shows three-group solutions 
found by the unweighted and n. s. i. versions of that algorithm, 
as described in Appendix B (online only), in the world trade ex- 
ample network. The unweighted version and the n. s. i. version 
using GDP as node weight plausibly place Europe and North 
Africa in one group, most of Asia in another, and the Ameri- 
cas in the third, with little differences in the placement of some 
"bordering" countries. (The placement of UGD (bottom left) 
in the blue group is an artifact of the algorithm due to its large 
distance from the network's centre.) When using population 
as node weight, the n. s. i. result however differs considerably, 
placing China and India in different groups, since the algorithm 
tends to produce groups of approximately equal weight. 



7 Application to climate networks 

Coming back to our original field of application, let us finally 
compare the unweighted and n. s. i. versions of a number of 
network measures in the case of a climate network whose node 
set is a latitude-longitude-regular grid on the Earth's surface, 
with a resolution of 2.5° in latitude and 3.75° in longitude. As 
in Donges et al. fl 1"31 ;33 1, two of these 6,816 nodes (the two 
poles have been excluded) were linked when the correspond- 
ing time-series of monthly averaged surface air temperature 
(SAT) anomalies from the 20th century reference run 20c3m of 
the Hadley Centre's HadCM3 model (as defined in the IPCC's 
Fourth Assessment Report, see 31| for details) showed a 
significant Pearson correlation coefficient. We chose the link 
inclusion threshold so as to achieve a relatively high link den- 
sity of approx. 0.05, for which all correlations of absolute value 
of at least 0.25 were considered as significant. 

As can be seen, e.g., from the resulting n. s.i. Newman- 
type random walk betweenness (Fig. 13 1, this network retains 
a number of interesting features of the global climate system. 
As argued in 1 11 1, increased values of Newman's random walk 
betweenness may be indicative of diffusive transport processes 
in the climate system, e. g., turbulent eddy diffusion in the at- 
mosphere and ocean, whereas shortest path betweeimess is be- 



16 Jobst Heitzig et al.: Node- weighted measures for complex networks with spatially embedded, sampled, or differently sized nodes 



lieved to trace advective transport processes such as strong sur- 
face ocean currents. 

For comparison, we defined a second, synthetic network on 
the same set of nodes in which we linked each pair of nodes /, j 
independently with a probability of min(l,exp(0.4 — 0.09 a,/)), 
where a,/ is the angular distance between the nodes (in de- 
grees). The exponential relationship between link probabity 
and distance was fitted to the relationship between the observed 
link density and angular distance in the above climate net- 
work, using non-linear regression (a similar relationship was 
found in |27|). The resulting benchmark network had a slightly 
smaller link density of approx. 0.035 and can be interpreted as 
a sample from an underlying continuous network whose link 
distribution depends on angular distance alone and is therefore 
rotationally and translationally symmetric. Because of this un- 
derlying symmetry, local network measures suitable for the es- 
timation of underlying features should not show a significant 
depency on the node's latitude. The thin dashed and dotted lines 
in Fig.|4]show the longitudinally averaged values of several net- 
work measures, plotted against latitude, in this benchmark net- 
work. We can see that the n. s. i. versions (dotted lines, using 
cos(latitude) as node weight) fulfil this requirement of latitude 
independence much better than the unweighted versions (thin 
dashed lines), which exhibit a clear systematical increase either 
towards the poles (degree, clustering coefficient, and closeness 
centrality) or towards the equator (Newman-type random walk 
betweenness). For the degree, this is due to the increase in ab- 
solute node density, while for the clustering coefficient, this is 
due to the increase of the density gradient towards the pole. 

As can be expected from this, the corresponding values in 
the real-world climate network also show differences between 
the unweighted version (solid lines) and the n. s. i. version 
(thick dashed lines) towards the poles or towards the equator. 
Unweighted Newman-type random walk betweenness, e. g., is 
higher in the region of the North and South Equatorial Currents 
at about ±10° latitude, while its n. s. i. version is higher in the 
region of the North Atlantic Subtropical Gyre between +15° 
and +60° latitude, although both show well-defined features in 
both regions. 

The influence of the increased node density in the high lat- 
itudes on unweighted network statistics becomes even more 
evident when focussing on the Arctic region north of +60° 
(Fig.|3]l. While the unweighted degree and clustering coeffi- 
cient are markedly increased close to the North Pole, their 
geographic distribution is considerably obscured by the node 
density induced bias further southwards (Fig.|3](A, B)). In con- 
trast, the n. s. i. variants of degree and clustering coefficient 
reveal more pronounced regional structures, e. g., increased 
n. s. i. degree over southern Greenland and Scandinavia, or 
increased n. s. i. clustering coefficient surrounding Greenland 
(Fig.|3](C, D)). Hence, we judge n. s. i. network measures to be 
very promising tools for future analysis of gridded climate data 
with inhomogeneous mesh cell areas or station data, particu- 
larly since the additional error that would be introduced by in- 
terpolation to equal area (geodesic) grids can thus be avoided. 







/ ■ ^ 


JSia^^T.^ ^■— J- 




















[ • ; i....j.....J.....^...J...../...J^....,^fy 







n.s.i. Newman random walk betweenness 



.0002 .0004 .0006 .0008 .001 .0012 .0014 .0016 .0018 .002 .0022 

Fig. 13. (Colour online) N. s. i. version NB* of Newman's random 
walk betweenness in a global climate network representing correla- 
tions in surface air temperature dynamics (same network as in Fig.|4] 
Robinson projection). We can clearly identify the regions of the North 
Pacific Subpolar Gyre, the North Atlantic Subtropical Gyre including 
the Gulf Stream and the Canary Current, the North and South Equa- 
torial Currents in the Pacific, and the Antarctic Circumpolar Current. 
The interpretation of other regions of high values like Scandinavia and 
Central and North-East Africa remains unclear. 

8 Conclusion 

To summarize, in this article we have introduced a fairly gen- 
eral framework to deal with biases and artifacts in complex net- 
work statistics that appear when the nodes represent differently 
sized parts of an underlying domain of interest. Networks of 
this type are routinely studied in various fields of research, in- 
cluding neuroscience, the Earth sciences, informatics and en- 
gineering, social sciences, economics, and dynamical systems, 
as our examples show, and network design choices made when 
describing the domain of interest by a complex network can 
have considerable effects on the results of network statistics. 

The central axiomatic notion of node splitting (or twin 
merging) invariance provides an elegant means to tackle the 
problem of how to use information on the size of (the part of 
the domain of interest represented by) individual nodes in a 
way that is robust against different choices of node selection, 
grid, meshing, parcellation, aggregation, or coarse-graining. 
Our framework allowed us to derive consistently weighted ver- 
sions of a large representative set of commonly used statistical 
network measures that quantify different aspects of networks 
in which links represent some kind of similarity or closeness. 
Despite the diversity of these n. s. i. measures, a simple set of 
design rules (given in Sec. |4.3| l guided the introduction of node 
weights into their definitions. The resulting formulas are in 
most cases computationally no more demanding than the origi- 
nal ones, and slightly modified versions of standard algorithms 
with the same complexity can usually be used to apply them 
in practice. Also, since the construction of all our measures is 
based on the same first principles, they work better together 
and allow for an easier interpretation than alternative ad hoc 
approaches might. 

Most importantly, n. s. i. measures reflect the features of the 
domain of interest more accurately than classical unweighted 
measures. This was demonstrated in particular in the case of 
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a synthetic and a real-world climate network, for which it 
was possible to compare the results of n. s. i. measures on a 
non-homogeneous grid with those based on a homogeneous 
(geodesic) grid, both showing exactly the same features and 
avoiding the artifacts that were produced by unweighted mea- 
sures on the non-homogeneous grid. We have further illustrated 
the applicability and practical relevance of our axiomatic ap- 
proach by qualitatively showing their effects in a number of 
semi-realistic example networks. In particular, we showed how 
in many of these examples, the judgement of which parts of the 
network are the most central or otherwise structurally impor- 
tant can change when node weights are used. 

Our results also indicate that the topological properties of 
network representations of technical infrastructure such as the 
internet depend on whether the sometimes considerably vary- 
ing size of the individual subsystems chosen as nodes in the 
network representation is taken into account or not, and we 
conjecture that n. s. i. measures will prove highly relevant and 
beneficial for the consistent analysis of the vulnerability of dis- 
tributed technical systems. When analyzing how the connec- 
tivity of the internet decreases due to targeted attacks [48 1, 
the size of both the attacked and the affected autonomous sys- 
tems should be of obvious interest. A more thorough study of 
the AS network used for illustration here is of course beyond 
the scope of this methodological paper. It would have to con- 
struct the links much more carefully by taking additional data 
sources into account, as described in |14|, and verify that the 
number of CIDR prefixes is indeed a suitable size measure for 
autonomous systems. We also leave for future research the gen- 
eral question of how one should choose node weights suitable 
for a given network and research question. 
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Appendix A: 
weight 



Measures corrected for a typical 



Since we introduced n. s. i. measures /* as weighted versions 
of existing network measures /, it is natural to consider the 
case where all node weights Wy are equal to some constant 
value CO. Usually, because of the construction mechanisms (i) 
and (ii) given in Sec. 4.3 /* will not exactly equal / in that 



case but rather be some (usually simple) transformation of 
it, and only for N °°, f* is often asymptotic to /, e. g., 
kl = [ky + 1)0 = (oky + |o(fev)| ^ ky. Whcn one wants to com- 
pare values of /* with those of /, this behaviour presents some 
difficulties, hence we will present here for most of the treated 
measures a second, somewhat more complex, corrected n. s. i. 
version of /*, denoted by f*'^, which is also n. s. i., but which 
involves a parameter O) > and has the property that 



= / whenever w,. = (O for all v e 



(45) 



so that f*'^ can be compared to / more easily than *. The pa- 
rameter o is then called the typical weight and can be thought 
of as a kind of resolution or scale on which the analysis focuses. 
The question of how a suitable value for O) can be estimated 
from G and w if the weights are not all equal is addressed later, 
in the next subsection. 

In many cases, the following construction mechanisms are 
helpful in addition to (i)-(iv) given in Sec. 4.3 to derive /*" 
from /*: 

(v) Correct for the effect of (i) by replacing each occurrence of 
Wi by w,/a), and 

(vi) correct for the effects of (ii) and (iii) by subtracting suitable 
terms (often constants) from sums over nodes. 

It will be convenient to express the corrections in terms of the 
corrected version of W, which can be called the corrected n. s. i. 
number of nodes, 



N*"'=W/(0. 



(46) 



In the example of degree, we apply both (v) and (vi) and put 

= vv,/a) - 1 - k*J(0 - 1 (47) 

which obviously reduces to ky if w,, = (£). In case of non- 
constant weights, it can happen that k^"^ turns out to be neg- 
ative for some v if o is chosen too large. The same effect can 
happen for corrected n. s. i. versions of other measures, as will 
be obvious from their definitions presented below. In general, 
negative values can be avoided by lowering (O or by replacing 
them by zero. 

For the local clustering coefficient, we have 

C:(w,- ^ 0)) = ^--^--(^^-ly^v+i ^ c., + \Oil)\. 

A corrected n. s. i. local clustering coefficient can only be de- 
fined for the case that k*,'^ > 1, 



1 



k*'^{k*">-l) 



^ 1, 



(48) 



where, following (vi), we subtract {3k*'^ + 1) since in 
{'^^i^'lja'fy^j, the nodes / and j can be equal or equal to v. For 
CO ^ Wv/2, one can prove that C**" > 0. 

Also corrected n. s. i. closeness centrality measures can be de- 
rived via (v) and (vi): 



cc: 



1 



s$ 1, 



1 /cc* - 1 /N*'^ 
CC'*'^ = cc'* - 1 /IN*'^ s$ 1, 



(49) 



In case of (shortest path) betweenness centrality BC* — 
iKb(^')/Kb)ab^ we only have to divide and «*^(v) by a suit- 
able power of CO, according to (v). Subtractions as in (vi) are 
unnecessary since we did not extend any sums to derive BC* 
from BCy. Hence the corrected n. s. i. versions are 



Co'-'-Kh, n:^iv) = CO^-'''^n:,iv), 
BCr - {n:n^)/n*M - ^BC: e [O,co/wy] 



(50) 



For random walk-based measures, a corrected n. s. i. version of 
the transition probabilities is only obvious in the patholog- 
ical case in which o ^ min,. w,. and when no isolated nodes 
exist. We could then put 



= 4(w,/a)-5„)A^ 



(51) 



For larger, more realistic choices of typical weight, this defi- 
nition of p*l^ would result in negative values for many / and 
could thus not be interpreted as a transition probability. We will 
therefore not present corrected versions of random-walk based 
measures. 

The aggregate statistics we presented also allow for corrected 
n. s. i. versions: For the global clustering coefficient it is 



while for transitivity it is 



{N*">Y{ay,aijaj,)r,j-N*'' {3{k*,'>)^- + 1) 



-A?*™(3(A:*'«)^' + 1) 



(52) 



^1. (53) 



For the link density, average (geodesic) path length, and global 
efficiency, it is just 



f^^Oi n* — 1 r*G) _ 1 Z7*ft) 1 



(54) 



For spectral analysis, the corrected n. s. i. versions of adjacency 
A and Laplacian A are 

A*'" = ^A* - I, A*"" = diag(r™) - A*"" = ^A*, 



A*"^=^A*'-\, A*"*^=diag(r'")-A*'™ = ^A*'. (55) 



In particular, EC*{wi = co) — ECy, hence n. s.i. eigenvector 
centrality needs no correction, and also unweighted and n. s. i. 
spectral bisections are identical if w, = 0). 
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Estimation of typical weight 

The usefulness of our corrected versions of the n. s. i. measures 
depends on a suitable choice of the typical weight O). In ap- 
plications in which the underlying domain of interest does not 
provide a natural choice for co, we can try to determine a suit- 
able CO from the network itself. Such an "estimate" a of (O 
should ideally (i) depend monotonically on the node weights 
Wv, (ii) lie in [min,, Wv,max,. w,,], (iii) be n. s. i. itself, and (iv) be 
small enough so that measures such as k*.'^ and C*'^ are defined 
and non-negative for all or at least almost all nodes. In addi- 
tion, it would be nice if (v) O) is statistically robust, i. e., cannot 
change unboundedly by only local changes to the network. 

To fulfil (i), (ii), and (v), the most natural choice seems to 
be the median node weight. With a small adjustment, also (iii) 
is satisfied; Define the n. s. i. twin-adjusted weight of v as 



(56) 



and let (bi be the w- weighted median of w'*. This fulfils (i)- 
(iii) and (v), though not necessarily (iv). Only when there are 
pathologically many twins, ft)/ might exceed max,, w,,. 
A different approach is to address (iv) first by putting 



_ 3/,* 
— 4^1, 



(57) 



Using 0) = minyvv'f* would then ensure that for all v, k*,'^ ^ 1 
and both C*™ and C'*"^ are defined and non-negative except 
maybe for those few v where the above minimum is attained. 
When V is not isolated, w'^* ^ min,, w,,, but it may easily exceed 
maXyVVy. Hence a good choice that fulfils (i)-(iv), though not 
(v), is 



(Oji = min (d)/, min,, w"' 



(58) 



Finally, a trade-off between (iv) and (v) can be made by using 
in this definition not min,, w"* but a small quantile, say the first 
percentile P^'{w^*), of the w-weighted distribution of w"*: 

0)/// = min(dj/,Pr(M''v'*))- (59) 

Appendix B: Some additional measures 

The average (nearest) neighbours ' degree of y, 

knnA- = 'Lie._y,.ki/ky, (60) 

represents the average size of the region a point linked to v is 
linked to. Using the plug-in mechanism (iv) and the correction 
mechanism (v) and (vi), we define its weighted versions, the 
n. s. i. average (nearest) neighbours' degree and the corrected 
n. s. i. average (nearest) neighbours' degree, as 

k:z=Lie.Ay^^'kr/(okr-i. 

These are n. s.i. since their ingredients k*, k*,, k'^"', and k*,'" 
are. In case of constant weights w, = (0, we have k*^^, = A:„„ ,,, 
whereas the uncorrected version is then 



,{Wi = CO) 



CO 



ik„„^y + 2)co + 0{l). 



The Pearson product-moment correlation coefficient between 
the degrees of the two end nodes of each link in G is called 
degree correlation or assortativity and can be computed from 
the degrees and average neighbours' degrees as follows: 



{klUky)y-i{kl)y)^ 



(61) 



A [corrected] n. s. i. degree correlation is most easily found 
using the plug-in mechanism (iv): 

*_ Vv '^nn.v/v \'^v/v W^v Iv) 

/7,*(b2i,*(B \w Ii,*CO\w /'/j,*(u2\M'\2 

\ V '*-n«,v/v \'*-v Iv W^v Iv ) 



Note that r* is the common Pearson correlation coefficient be- 
tween k*^ and k* in the probabilistic model in which the link 
or self-loop / — j is drawn with relative probability WjWj. The 
latter measures the number of links in Go between the regions 
represented by / and j, hence r* estimates the correlation coef- 
ficient between the degrees of the two end points of all links in 
Go. 

Soffer and Vazquez f50l justify a version of the clustering co- 
efficient Cy which is partially adjusted for degree correlations: 



N {ciyiQ.ijCl ^'y) ij 

LG.4c(min(fe,-,^,,)-l) 



e[Cv,l] 



(62) 



Again using all four mechanisms (i)-(iv), we get their n. s. i. 
versions: 

W^{a+alfal,)Yi 
C* = ^ ll\.^ ^[^v'l] and 

(A^*«)2(«+4a+)-.-3^r-l 



"ie^v CO 



'^mm{k*'^,k*'^)-2k*c 



where we define the latter only for the case where 

Bonacich (5T\ defined a measure of v's power centrality based 
on the idea that a node's "power" in a network should be the 
sum of a linear function of each of its neighbours' powers: 



PCy^Z.s.J'-aviia+liPQ) 



(63) 



for some parameters a,j5 > 0. This implicit definition is solved 
by the power centrality vector 



PC = iPQ)ie_A' = a{\-liA)-'k 



(64) 



where k = {ki)i^^^Y is the degree vector, assuming that I — )3 A is 
invertible (which will be true for a general choice of j3). To find 
the n. s. i. version, we make the defining equation n. s. i. and 
require that v's power is the weighted sum of the linear function 
of its neighbours' powers: 



PC:^^,^,_^ra+wda + l5PC* 



(65) 
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Whenever I — )3A+D,v is invertible, the solution 

PC* ^a{\-l5A*)-^k* (66) 

is then the sought n. s. i. power centrality vector, where k* is 
the n. s. i. degree vector, k* = {k*)ie,y- 

Following (v) and (vi) again, the corrected n. s. i. equation 

is 

per = Lfe^fl+^(a + j3PCr) - {a + fiPCr), (67) 
and its solution is the corrected n. s. i. power centrality vector 
PC*'^ ^a{\~l5A*'^y^ k*'^. (68) 

The random walk centrality of v as defined by Noh and Rieger 
| [52) measures how fast a walk starting at a random node 
reaches v: 

RWC, = p,/ ir=o ((^')vv - Pv) ■ (69) 
The obvious n. s. i. version of this is 

Rwc;^Pt/^'r=o{iip*yh--pi)- 

The normal matrix T — diag(A:)^'A is also often used as the 
basis of a centrality measure and has the n. s. i. versions 

T* =diag(r)-iA*, T*'*'=diag(r™)-iA*" (70) 

which both retain T's property of having row sums equal to 
one. 



Similar to the matrices A, A, and T, also the spectrum of the 
modularity matrix B = {Bjj)ij can be made n. s. i., using the 
matrices B+ = {B+j)ij and B+'*' = {Blj'^)ij: 

B* = d'J^b+dIJ\ b*" = idJ/'b+^d;/' - I. 

Given a subset g C of the nodes, the generalized modularity 
matrix is the \g\ x \g\ matrix B'*'' = {B\f)ij with 

Blf =Bij-5ij^,e,B... (72) 

Newman p4) uses the signs in the eigenvector of its largest 
positive eigenvalue in an efficient iterative network dividing al- 
gorithm similar to Fiedler's spectral bisection method, in which 
a hierarchical clustering tree is constructed top-down by start- 
ing with go = jy , bisecting into two groups g\^g2 according 
to the eigenvector signs, and then repeating with each gi thus 
obtained. The following versions of B^''') can be used to derive 
similar n. s. i. network divisions, as exemplified in Fig.|2j 

If we assume that when splitting s — > + s" with s e g, both 
s' and s" are put into the new g, then the eigenvectors of these 
matrices have the same n. s. i. properties as those of B* above. 
In particular, the eigenvector entries for and s" have the same 
sign as that of s, whereas all other signs are unchanged, hence 
the division of g will be the same after the spht except that s is 
replaced by s' and s" in its subgroup. 



Modularity 

Following Newman f24l, the modularity of a partition of 
jy expresses how well the groups defined by 3^ are internally 
connected and separated from each other It is defined as the 
observed within-group link density minus its expected value 
given the observed degree distribution: 

„ _ {S^{i).^{j)Bij),j kjkj 

{^i ^---"--A^' ^''^ 

where ^(/) is that set in which contains /. The (corrected) 
n. s. i. versions of these are: 



Q* estimates the within-group link density in Go minus its ex- 
pected value given the degree distribution in Go, for the parti- 
tion of yii) induced by the partition ^ of 



